Welcome everybody to deep learning. So today we want to continue talking about the different
loss and optimization and we want to go ahead and talk a bit about more details of these
interesting problems.
We have a search technique which is just local search, gradient descent, to try to find a
program that is running on these recount networks such that it can solve some interesting problems
such as speech recognition or machine translation and something like that.
So let's talk first about the loss functions.
So loss functions are generally used for different tasks and for different tasks you have different
loss functions.
And the two most important ones that we are facing are regression and classification.
So in classification you want to estimate a discrete variable for every input.
So this means that you want to essentially decide in this two class problem here on the
left whether it's blue or red dots.
So you need to model a decision boundary.
In regression the idea is that you want to model a function that explains your data.
So you have some input function, let's say x2, and you want to predict x1 from it.
So you compute a function that will produce the appropriate value of x1 for any given
x2.
Here in this example you can see this is a line fit.
So we talked about activation functions, last activation, softmax, cross entropy loss and
somehow we combined them.
And obviously there's a difference between the last activation function in a network
and the loss function.
Because the last activation function is applied on the individual samples xm of each of the
batch and it will also be present at training and testing time.
So the last activation function will become part of the network and will remain there,
produces the output or the prediction and generally produces a vector of the world in
vector space.
But I think this is very difficult for people to understand.
They would not know what they're looking at.
Now the loss function combines all m samples and labels and in their combination they produce
a loss that describes how good the fit is.
So it's only present during the training time and this loss is generally a scalar value
you only needed during training time.
Interestingly many of those loss functions can be put in a probabilistic framework and
this leads us then to maximum likelihood estimation.
And maximum likelihood estimation just as a reminder we then consider everything to
be probabilistic.
So we have a set of observations capital X that consists of the individual observations.
Then we have the associated labels so they also stem from some distribution and the observations
are then y1 to ym.
And then of course we need a conditional probability density function that describes us somehow
how y and x are in relation.
In particular we can compute the probability for y given some observation x that will be
then very useful if you want to decide for a specific class for example.
Now we have to somehow model this data set.
They are given from some distribution so they are drawn for some distribution and the joint
probability for the given data set can then be produced as a product over the individual
conditional probabilities.
And of course if they're independent and identically distributed then you can simply write this
Presenters
Zugänglich über
Offener Zugang
Dauer
00:15:31 Min
Aufnahmedatum
2020-04-21
Hochgeladen am
2020-04-21 02:06:12
Sprache
en-US
Deep Learning - Loss and Optimization Part 1
This video explains how to derive L2 Loss and Cross-Entropy Loss from statistical assumptions. Highly relevant for the oral exam!
Video References:
Lex Fridman's Channel
Further Reading:
A gentle Introduction to Deep Learning